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Abstract 

Since  its  inception,  VLSI  theory  has  expanded  in  many  fruitful  and 
interesting  directions.  One  major  branch  is  layout  theory  which  stud¬ 
ies  the  efficiency  with  which  graphs  can  be  embedded  in  the  plane  ac¬ 
cording  to  VLSI  design  rules.  In  this  survey  paper,  I  review  some  of  the 
major  accomplishments  of  VLSI  layout  theory  and  discuss  how  layout 
theory  engendered  the  notion  of  area  and  volume-universal  networks, 
such  as  fat-trees.  These  scalable  networks  offer  a  flexible  alternative 
to  the  more  common  hypercube- based  networks  for  interconnecting 
the  processors  of  large  parallel  supercomputers.  (This  paper  was  an 
invited  presentation  at  the  1989  Cfdtech  Decennial  VLSI  Conference.) 

Keywords:  Fat-trees,  hypercubes,  integrated  circuits,  interconnec¬ 
tion  networks,  layout  theory,  parallel  computing,  supercomputing, 
universality,  Thompson’s  model,  tree-of-meshes,  VLSI  theory. 


Accession  For 

NTIS  GRA&I 
DTIC  TAB 

Unannounced  □ 
Justification _ 


f 


By_ _ 

Distribution/ 


Availability  Codes 
Avail  and/or 
Special 


Ten  years  ago,  Clark  Thompson  introduced  a  simple,  graph-theoretic 
model  for  VLSI  circuitry  [22].  In  Thompson’s  model,  a  circuit  is  a  graph 
whose  vertices  correspond  to  active  circuit  elements  and  whose  edges  corre¬ 
spond  to  wires.  A  VLSI  layout  is  a  mapping  of  the  graph  to  a  two-dimensional 
grid,  such  that  each  vertex  is  mapped  to  a  square  region  of  the  grid  and  each 
edge  is  mapped  to  a  path  in  the  grid.  Unlike  the  classical  notions  of  a  graph 
embedding  from  mathematics,  Thompson’s  model  allows  edges  of  a  graph  to 
cross  over  one  another,  like  wires  on  an  integrated  circuit. 

The  interesting  cost  measure  in  VLSI  is  area.  In  Thompson’s  model, 
area  can  be  measure  as  the  number  of  grid  points  occupied  by  edges  or 
vertices  of  the  graph.  Quickly,  the  minimum-area  layouts  for  familiar  graphs 
were  catalogued.  As  shown  in  Figure  1,  a  mesh  (two-dimensional  array) 
with  n  vertices  (\/n  by  \/n)  has  Q(n)  area.1  The  normal  way  of  drawing  a 
complete  binary  tree  (Figure  2a)  has  0(nlg?r)  area,  but  the  “H-tree”  layout 
(Figure  2b)  is  much  better:  it  has  0(n)  area.  A  hypercube,  which  is  a  popular 
interconnection  network  for  parallel  computers,  requires  considerably  more 
area — 0(n2). 

What  causes  a  hypercube  to  occupy  so  much  area?  Although  the  size  of  a 
vertex  grows  slowly  with  the  number  of  vertices  in  a  hypercube,  most  of  the 
area  of  a  hypercube  layout  is  devoted  to  wires.  Figure  3  shows  how  the  the 
problem  of  wiring  a  hypercube  grows  with  the  size  of  the  hypercube.  Wires 
are  expensive,  and  wire  area  represents  the  capital  cost  of  communication 
on  a  VLSI  chip.  By  measuring  communication  costs  in  terms  of  the  geo¬ 
metric  concept  of  area,  Thompson’s  model  enabled  a  mathematical  theory 
of  communication  in  VLSI  systems  to  develop. 

From  its  origin,  VLSI  theory  has  expanded  in  many  fruitful  and  interest¬ 
ing  directions.  Rather  than  attempting  to  describe  the  breadth  of  research 
in  VLSI  theory,  however,  I  would  like  to  revisit  the  accomplishments  along 
one  narrow  path — layout  theory  -  which  I  believe  will  have  a  fundamental 
impact  on  the  architecture  of  large  parallel  supercomputers. 

In  his  early  work,  Thompson  discovered  an  important  lower  bound.  The 
area  of  an  7i-vertex  graph  is  related  to  its  bisection  width :  the  minimum  num¬ 
ber  of  edges  that  must  be  removed  to  partition  the  graph  into  two  subgraphs 

:The  notation  0(/(n))  means  a  function  that  grows  at  the  same  rate  as  /(»)  to  within  a 
constant  factor  as  n  becomes  large.  The  notation  0(/(n))  means  a  function  that  grows  no 
more  quickly,  and  R(/(n))  means  a  function  that  grows  no  more  slowly.  Formal  definitions 
for  these  terms  can  be  found  in  any  textbook  on  analysis  of  computer  algorithms. 


Figure  1:  A  mesh  (two-dimensional  array)  on  n  vertices  has  a  VLSI  layout 
with  O(n)  area. 


Figure  2:  A  complete  binary  tree  on  n  vertices  laid  out  in  the  standard  way 
(a)  takes  O(nlgn)  area,  but  an  H-tree  layout  (b)  requires  only  0(n)  area. 
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Figure  3:  Illustrations  (not  layouts)  of  hypercubes  on  4,  8,  16,  and  32  vertices. 
Any  layout  of  an  n- vertex  hypercube  requries  Q(n2)  area. 

of  n/ 2  vertices  (to  within  1,  if  the  number  of  vertices  is  odd).  For  example, 
an  n- vertex  mesh  has  a  bisection  width  of  s/n.  A  complete  binary  tree  has  a 
bisection  width  of  1.  A  hypercube  has  a  bisection  width  of  n/2.  Thompson 
proved  that  any  layout  of  a  graph  with  bisection  width  w  requires  Q.(w2) 
area. 

It  turns  out  that  a  small  bisection  width  does  not  lead  immediately  to  a 
*  small-area  layout.  After  all,  if  we  take  two  n/2-vertex  subgraphs,  each  with 
0(n2)  area,  and  connect  them  by  a  single  edge,  the  resulting  graph  has  a 
bisection  width  of  1  but  still  requires  0(n2)  area.  Leslie  Valiant  and  I  were 
able  to  show  in  independent  work  [24,  14,  15],  however,  that  if  there  is  a  good 
recursive  decomposition  of  a  graph — one  where  we  can  keep  subdividing  the 
subgraphs  without  cutting  many  edges — then  the  graph  has  a  small  layout. 
For  example,  not  only  complete  binary  trees,  but  any  binary  tree,  no  matter 
how  badly  balanced,  can  be  laid  out  in  0{n)  area  by  a  divide-and-conquer 
method.  Valiant  and  I  were  also  able  to  show  that  this  method  lays  out  any 
n-vertex  planar  graph  in  0(nlg27?)  area.  Later,  Leighton  was  able  to  show 
that  a  variant  of  our  method  was  optimal  on  any  graph  to  within  a  0(lg2  n ) 


3 


Figure  4:  The  tree-of- meshes  graph. 


factor  in  area  [9]. 

Leighton  also  introduced  an  interesting  graph  which  he  called  the  tree- 
of-meshes  graph,  shown  in  Figure  4.  He  was  able  to  prove  that  this  graph 
requires  fi(nlgn)  area,  thereby  refuting  a  conjecture  of  mine  that  all  planar 
graphs  could  be  laid  out  in  O(n)  area.  It  remains  an  open  question  in  VLSI 
theory  as  to  whether  there  exists  a  planar  graph  that  requires  H(nlg2  n)  area, 
or  if  all  planar  graphs  can  be  laid  out  in  0(n  lgn)  area. 

Numerous  other  results  in  layout  theory  have  been  obtained— too  many  to 
mention  them  all.  Paterson,  Ruzzo,  and  Snyder  [18]  and  Bhatt  and  Leiserson 
[3]  studied  how  to  keep  wires  short  while  preserving  small  area.  Valiant  [24], 
Ruzzo  and  Snyder  [21],  and  Dolev,  Leighton,  and  Trickey  [4]  studied  VLSI 
layouts  in  which  wires  are  not  allowed  to  cross.  Three-dimensional  integration 


was  studied  by  Rosenberg  [20],  Leighton  and  Rosenberg  [13],  and  Greenberg 
and  Leiserson  [5].  Fault,  tolerance  in  wafer-scale  circuits  was  studied  by 
Rosenberg  [19],  Leighton  and  Leiserson  [11,  10],  and  Greene  and  El  Gamal 
[7].  The  packaging  of  graphs  into  chips  was  studied  by  Leiserson  [15]  and 
Bhatt  and  Leiserson  [2]. 

In  fact,  packaging  constraints  are  analogous  to  the  constraints  in  Thomp¬ 
son’s  model.  At  any  level  of  packaging — chips,  boards,  backplanes,  racks, 
or  cabinets — manufacturing  technology  constrains  the  number  of  external 
connections  from  a  package  to  be  much  smaller  than  the  number  of  compo¬ 
nents  within  the  package.  In  Thompson’s  model,  a  square  region  with  side  s 
can  support  4s  external  connections,  but  it  can  contain  s 2  vertices,  which  is 
considerably  larger  than  4s  as  s  becomes  large. 

As  an  example  of  a  result  [15]  in  packaging,  Figure  5  shows  a  novel  way 
to  package  a  complete  binary  tree  using  4-pin  packages  of  a  single  type.  Each 
chip  contains  one  internal  node  of  the  tree,  with  three  external  connections, 
and  the  remainder  of  the  chip  is  packed  as  full  as  possible  with  a  complete 
binary  tree,  with  one  external  connection.  To  assemble  a  tree  with  twice  as 
many  leaves,  we  use  two  chips.  We  wire  up  one  of  the  unconnected  internal 
nodes  on  one  of  the  chips  as  the  parent  of  the  two  complete  binary  trees. 
We  are  left  with  a  complete  binary  tree  with  twice  as  many  leaves,  plus  one 
unconnected  internal  node.  Thus,  considering  the  two  chips  as  a  single  unit, 
the  structure  is  the  same  as  the  one  with  which  we  began.  By  repeating 
the  process,  we  can  recursively  assemble  a  complete  binary  tree  of  arbitrarily 
large  size. 

The  work  in  layout  theory  culminated  with  the  development  by  Bhatt 
and  Leighton  [1]  of  a  general  framework  for  VLpI  layout.  Ihey  proposed  a 
layout  method  with  which  they  were  able  to  obtain  optimal  or  near-optimal 
layouts  for  many  graph-embedding  problems.  Their  method  has  three  steps. 
First,  recursively  bisect  the  graph,  forming  a  decomposition  tree  of  the  graph. 
Second,  embed  the  graph  in  the  tree-of- meshes  graph  (Figure  4),  typically, 
with  the  vertices  of  the  graph  at  the  leaves  of  the  tree-of-meshes  graph.  The 
meshes  in  the  tree-of-meshes  are  used  as  crossbar  switches  for  routing  the 
edges  of  the  graph.  The  layout  of  the  graph  is  then  obtained  by  looking  at 
where  the  vertices  and  edges  are  mapped  when  the  the  tree-of-meshes  graph 
is  laid  out  according  to  known  good  layouts. 

It  seemed  to  me  at  the  time  that.  Bhatt  and  Leighton  had  solved  nearly 
all  the  interesting  open  problems  in  VLSI  layout  theory.  All  new  results  in 


Figure  5:  Packaging  a  complete  binary  tree. 

the  area  would  be  little  more  than  refinements  of  existing  methods  with  no 
more  real  insights  into  the  nature  of  interconnectivity.  I  turned  my  attention 
toward  parallel  computation,  in  which  I  had  continued  to  be  involved  since 
my  work  with  H.  T.  Kung  on  systolic  arrays  [8]. 

In  fact,  I  was  very  much  a  proponent  of  special-purpose  parallel  compu¬ 
tation  over  general-purpose  parallel  computation,  largely  as  a  result  of  my 
work  on  VLSI  layout  theory.  After  all.  as  Kung  and  I  had  shown,  and  as 
Kung  has  continued  to  forcefully  demonstrate,  many  computations  can  be 
performed  efficiently  on  simple  linear-area  structures  such  as  one  and  two- 
dimensional  arrays.  These  special-purpose  networks  have  the  nice  property 
that  they  can  be  laid  out  so  that  processors  are  dense  and  packaging  costs 
are  minimized.  Moreover,  for  many  problems,  they  otfer  speedup  which  is 
linear  in  the  number  of  processors  in  the  systolic  array. 

General-purpose  parallel  computers,  on  the  other  hand,  are  typically 
based  on  interconnection  networks,  such  as  hypercubes,  that  are  very  costly 
for  the  computation  they  provide.  For  example,  any  hypercube  network  em¬ 
bedded  in  area  A  has  at  most  0(\/A)  processors.  The  processors  are  therefore 
sparse  in  the  embedding,  and  connections  dominate  the  cost.  Similar  results 
can  be  shown  for  a  three-dimensional  VLSI  model.  Only  0(V2^3)  processors 
of  a  hypercube  network  can  fit  in  a  volume  V. 

Hypercube  networks  do  have  a  major  advantage  over  many  other  networks 
for  parallel  computing,  however.  They  are  universal:  a  hypercube  on  n 
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processors  can  simulate  any  /(-processor  bounded-degree  network  in  0(lg/() 
time.  The  simulation  overhead  is  polylogarithmic  (a  polynomial  of  lg/?),  an 
indication  that  the  simulation  is  a  parallel  simulation.  A  polynomial  overhead 
in  simulation  is  less  interesting,  since  0(/z)  overhead  is  easily  obtained  by  a 
serial  processor  simulating  each  of  the  n  processors  in  turn. 

The  proof  that  an  /(-processor  hypercube  is  universal  goes  roughly  as 
follows.  Suppose  we  have  a  bounded-degree  network  R  with  /?  processors. 
Each  processor  can  communicate  with  all  its  neighbors  in  unit  time.  The 
hypercube  can  simulate  the  network,  therefore,  by  sending  at  most  a  constant 
number  of  messages  from  each  processor,  where  each  message  contains  the 
information  that  travels  on  one  of  t  fie  interconnections  in  R.  It  turns  out, 
all  messages  can  be  routed  on  the  hypercube  to  their  destinations  in  O(lgn) 
time  [23]. 

The  notion  of  universality  the  ability  of  one  machine  to  efficiently  sim¬ 
ulate  every  machine  in  a  class  is  central  to  the  origins  of  computer  science. 
A  universal  machine  is  the  computer  theorist’s  idea  of  a  general-purpose,  as 
opposed  to  multipurpose,  machine.  A  universal  machine  can  do  the  function 
of  any  machine,  just  by  programming  it,  or,  in  the  case  of  parallel-processing 
networks,  just  by  routing  messages.  A  universal  machine  may  not  be  the 
best  machine  for  any  given  job,  but  it  is  never  much  worse  than  the  best. 
The  universality  theorem  for  hypercubes  does  not  say  that  a  hypercube  is 
the  fastest  network  to  build  on  n  processors.  What  it  says  is  that  the  fastest 
special-purpose  network  for  any  given  problem  can’t  be  much  faster. 

From  a  VLSI  theory  standpoint  ,  however,  a  special-purpose  parallel  ma¬ 
chine  has  a  clear  advantage  over  a  universal  parallel  machine.  Packaging  its 
network  can  cost  much  less.  And  although  universality  is  a  selling  point, 
our  economy  favors  machines  that  are  cheap  and  efficient,  even  if  they  are 
not  universal.  (How  many  combination  telephone-lawnmower-toothbrushes 
have  been  sold  recently?)  Special-purpose  networks  for  parallel  computation 
are  much  cheaper  than  hypercube  networks.  Thus,  for  a  long  time,  I  was 
skeptical  about  the  cost-effectiveness  of  general-purpose  parallel  computing. 

I  changed  my  mind,  however,  and  became  an  advocate  general-purpose 
parallel  computing  when  1  started  to  look  more  closely  at  the  traditional 
assumptions  concerning  universal  networks.  In  fact,  from  a  VLSI  theory 
perspective,  1  discovered  that  hypercubes  are  not  really  “universal”  at  all! 
An  //-processor  hypercube  may  be  able  to  efficient  ly  simulate  any  //-processor 
bounded-degree  network,  but  if  we  normalize  by  area  instead  of  by  number  of 


Figure  6:  An  area- universal  fat-tree. 


processors,  we  discover  that  an  area- A  hypercube  cannot  simulate  all  area- A 
networks  efficiently.  For  example,  since  an  area- A  hvpercube  has  only  0(  \/A) 
processors,  it  can’t  simulate  an  area- A  mesh,  which  has  0(A)  processors,  in 
polylogarithmic  time.  A  network  that  is  universal  from  a  VLSI  point  of  view 
should  be  a  network  that  for  a  given  area  can  efficiently  simulate  any  other 
netwc  k  of  comparable  area. 

One  such  area-universal  network  is  a  fat-tree  [16,  6],  which  is  based  on 
Leighton’s  tree-of- meshes  graph.  As  shown  in  Figure  6,  processors  occupy 
the  leaves  of  the  tree,  and  the  meshes  are  replaced  with  switches.  Unlike 
a  computer  scientist’s  traditional  notion  of  a  tree,  a  fat-tree  is  more  like 
a  real  tree  in  that  it  gets  thicker  further  from  the  leaves.  Local  messages 
can  be  routed  within  subtrees,  like  phone  calls  in  a  telephone  exchange, 
thereby  requiring  no  bandwidth  higher  in  the  tree.  The  number  of  external 
connections  from  a  subtree  with  rn  processors  is  proportional  to  >Jrn ,  which 
is  the  perimeter  of  a  region  of  area  rn.  The  area  of  the  network  is  0(n  lg2  n), 
which  is  nearly  linear  in  the  number  n  of  processors.  Thus,  the  processors 
are  packed  densely  in  the  layout. 


Figure  7:  Any  area-n  network  R  can  be  efficiently  simulated  by  an  n- 
processor  area-universal  fat-tree. 

Any  network  R  that  fits  in  a  square  of  area  n  can  be  efficiently  simulated 
by  an  area-universal  fat-tree  on  n  processors.  To  perform  the  simulation, 
we  ignore  the  wires  in  R  and  map  the  processors  of  R  to  the  processors  of 
the  fat-tree  in  the  natural  geometric  way,  as  shown  in  Figure  7.  As  in  the 
hypercube  simulation,  each  wire  of  R  is  replaced  by  a  message  in  the  fat- 
tree.  If  we  look  at  any  m-processor  subtree  of  the  fat-tree,  it  simulates  at 
most  a  region  of  area  m  in  the  layout  of  R.  The  number  of  wires  that  can 
leave  this  area-m  region  in  R's  layout  is  and  the  fat-tree  channel 

connecting  to  the  root  of  the  subtree  has  0(>/m)  wires.  Thus,  the  load  factor 
of  the  channel,  the  ratio  of  the  number  of  messages  to  channel  bandwidth,  is 
0(1).  It  turns  out  that  there  are  routing  algorithms  [16,  6,  12]  that  effectively 
guarantee  that  all  messages  are  delivered  in  polylogarithmic  time.  (In  fact, 
the  algorithms  can  deliver  messages  near  optimally  even  if  the  load  factor  is 
quite  large.) 

Similar  universality  theorems  can  be  proved  for  three-dimensional  VLSI 
models  using  volume-universal  fat-trees.  For  a  fat-tree  to  be  universal  for 
volume,  however,  the  channel  capacities  must  be  selected  differently  from 
those  in  an  area-universal  network.  Whereas  the  average  growth  rate  of 
channels  in  an  area-universal  fat-tree  is  \/2,  the  average  growth  rate  in  a 
volume-universal  fat  tree  is  v^4. 


Iii  practice,  of  course,  no  mathematical  rule  governs  interconnect  tech¬ 
nology.  Most  networks  that  have  been  proposed  for  parallel  processing,  such 
as  meshes  and  hypercubes,  are  inflexible  when  it  comes  to  adapting  their 
topologies  to  the  arbitrary  bandwidths  provided  by  packaging  technology. 
The  growth  in  channel  bandwidth  of  a  fat-tree,  however,  is  not  constrained 
to  follow  a  prescribed  mathematical  formula.  The  channels  of  a  fat-tree 
can  be  adapted  to  effectively  utilize  whatever  bandwidths  the  technology 
can  provide  and  which  make  engineering  sense  in  terms  of  cost  and  per¬ 
formance.  Figure  8  shows  one  variant  of  a  fat-tree  composed  of  two  kinds 
of  small  switches:  a  three-connection  switch  and  a  four-connection  switch. 
By  choosing  one  of  these  two  kinds  of  switches  at  each  level  of  the  fat-tree, 
the  bandwidths  of  channels  can  be  adjusted.  If  the  three-connection  switch 
is  always  selected,  an  ordinary  complete  binary  tree  results.  If  the  four- 
connection  switch  is  always  selected,  a  butterfly  network  which  is  a  relative 
of  a  hypercube,  results.  By  suitably  mixing  these  two  kinds  of  switches,  a 
fat- tree  that  falls  between  these  two  extremes  can  be  constructed  that  closely 
matches  the  the  bandwidths  provided  by  the  interconnect  technology. 

The  notion  of  locality  exploited  by  fat-trees  is  but  one  of  three  such 
notions  that  arise  in  the  engineering  of  a  parallel  computer.  The  most  ba¬ 
sic  notion  of  locality  is  exemplified  by  wire  delay  and  measured  in  distance. 
Communication  is  speed-of-light  limited.  If  this  notion  of  locality  dominates, 
the  nearest-neighbor  communication  provided  by  a  three-dimensional  mesh  is 
the  best  one  can  hope.  For  many  systems,  however,  wire  delay  is  dominated 
by  the  time  it  takes  for  logic  circuits  to  compute  their  functions.  The  second 
notion  of  locality  is  exemplified  by  levels  of  logic  circuits  and  measured  in 
gate  delays.  Communication  time  is  essentially  limited  by  the  number  of 
switches  a  message  passes  through.  From  this  point  of  view,  structures  with 
small  diameters,  such  as  hypercubes,  seem  ideal.  In  a  routing  network,  how¬ 
ever,  a  heavy  load  of  messages  can  cause  congestion,  and  the  time  it  takes  to 
resolve  this  congestion  can  dominate  both  wire  and  gate  delays.  Congestion 
is  especially  likely  to  occur  in  networks  that  make  efficient  use  of  packag¬ 
ing  technology.  The  last  notion  of  locality  is  exemplified  by  the  congestion 
of  messages  leaving  a  subsystem  and  measured  by  load  factor.  From  this 
standpoint,  fat-trees  offer  provably  good  performance  by  a  general-purpose 
network  that  can  be  packaged  efficiently.  Recent  work  [17]  has  shown  that 
efficient  parallel  algorithms  can  be  designed  for  this  kind  of  network,  as  well. 

Whatever  the  point  of  view,  however,  all  three  notions  of  locality  must 
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Figure  8:  A  scalable  fat-tree. 

guide  the  engineering  and  programming  of  very  large  machines.  There  are 
problems  in  the  sciences  that  cry  out  for  massive  amounts  of  computation, 
most  of  which  exhibit  locality  naturally:  problems  in  astronomy,  such  as 
galaxy  simulation;  problems  in  biology,  such  as  the  combinatorics  of  DNA 
sequencing;  problems  in  economics,  such  as  market  prediction;  problems  in 
aerospace,  such  as  fluid-flow  simulation;  problems  in  earth,  atmospheric, 
and  ocean  sciences,  such  as  earthquake  and  weather  prediction.  To  address 
these  problems  effectively,  very  large  parallel  computers  must  be  constructed. 
Some  of  these  computers  may  even  be  “building  sized.”  To  construct  and 
program  such  large  machines,  however,  locality  must  be  exploited,  and  com¬ 
puter  engineers  must  come  to  grips  with  the  lessons  of  VLSI  theory. 
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